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Abstract 


The Measuring Network Quality for End-Users workshop was held virtually by the Internet 
Architecture Board (IAB) on September 14-16, 2021. This report summarizes the workshop, the 
topics discussed, and some preliminary conclusions drawn at the end of the workshop. 


Note that this document is a report on the proceedings of the workshop. The views and positions 
documented in this report are those of the workshop participants and do not necessarily reflect 
IAB views and positions. 
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1. Introduction 


The Internet Architecture Board (IAB) holds occasional workshops designed to consider long- 
term issues and strategies for the Internet, and to suggest future directions for the Internet 
architecture. This long-term planning function of the IAB is complementary to the ongoing 
engineering efforts performed by working groups of the Internet Engineering Task Force (IETF). 


The Measuring Network Quality for End-Users workshop [WORKSHOP] was held virtually by the 
Internet Architecture Board (IAB) on September 14-16, 2021. This report summarizes the 
workshop, the topics discussed, and some preliminary conclusions drawn at the end of the 
workshop. 


1.1. Problem Space 


The Internet in 2021 is quite different from what it was 10 years ago. Today, it is a crucial part of 
everyone's daily life. People use the Internet for their social life, for their daily jobs, for routine 
shopping, and for keeping up with major events. An increasing number of people can access a 
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gigabit connection, which would be hard to imagine a decade ago. Additionally, thanks to 
improvements in security, people trust the Internet for financial banking transactions, 
purchasing goods, and everyday bill payments. 


At the same time, some aspects of the end-user experience have not improved as much. Many 
users have typical connection latencies that remain at decade-old levels. Despite significant 
reliability improvements in data center environments, end users also still often see interruptions 
in service. Despite algorithmic advances in the field of control theory, one still finds that the 
queuing delays in the last-mile equipment exceeds the accumulated transit delays. Transport 
improvements, such as QUIC, Multipath TCP, and TCP Fast Open, are still not fully supported in 
some networks. Likewise, various advances in the security and privacy of user data are not 
widely supported, such as encrypted DNS to the local resolver. 


Some of the major factors behind this lack of progress is the popular perception that throughput 
is often the sole measure of the quality of Internet connectivity. With such a narrow focus, the 
Measuring Network Quality for End-Users workshop aimed to discuss various topics: 


e What is user latency under typical working conditions? 

e How reliable is connectivity across longer time periods? 

* Do networks allow the use of a broad range of protocols? 

e What services can be run by network clients? 

e What kind of IPv4, NAT, or IPv6 connectivity is offered, and are there firewalls? 
e What security mechanisms are available for local services, such as DNS? 


e To what degree are the privacy, confidentiality, integrity, and authenticity of user 
communications guarded? 


e Improving these aspects of network quality will likely depend on measuring and exposing 
metrics in a meaningful way to all involved parties, including to end users. Such 
measurement and exposure of the right metrics will allow service providers and network 
operators to concentrate focus on their users' experience and will simultaneously empower 
users to choose the Internet Service Providers (ISPs) that can deliver the best experience 
based on their needs. 


e What are the fundamental properties of a network that contributes to a good user 
experience? 

e What metrics quantify these properties, and how can we collect such metrics in a practical 
way? 

e What are the best practices for interpreting those metrics and incorporating them in a 
decision-making process? 


e What are the best ways to communicate these properties to service providers and network 
operators? 


e How can these metrics be displayed to users in a meaningful way? 
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2. Workshop Agenda 


The Measuring Network Quality for End-Users workshop was divided into the following main 
topic areas; see further discussion in Sections 4 and 5: 


* Introduction overviews and a keynote by Vint Cerf 
e Metrics considerations 

e Cross-layer considerations 

e Synthesis 

e Group conclusions 


3. Position Papers 


The following position papers were received for consideration by the workshop attendees. The 
workshop's web page [WORKSHOP] contains archives of the papers, presentations, and recorded 
videos. 


e Ahmed Aldabbagh. "Regulatory perspective on measuring network quality for end users" 
[Aldabbagh2021] 


e Al Morton. "Dream-Pipe or Pipe-Dream: What Do Users Want (and how can we assure it)?" 
[Morton2021] 


e Alexander Kozlov. "The 2021 National Internet Segment Reliability Research" 
* Anna Brunstrom. "Measuring network quality - the MONROE experience" 


* Bob Briscoe, Greg White, Vidhi Goel, and Koen De Schepper. "A Single Common Metric to 
Characterize Varying Packet Delay" [Briscoe2021] 


* Brandon Schlinker. "Internet Performance from Facebook's Edge" [Schlinker2019] 


e Christoph Paasch, Kristen McIntyre, Randall Meyer, Stuart Cheshire, and Omer Shapira. "An 
end-user approach to the Internet Score" [McIntyre2021] 


e Christoph Paasch, Randall Meyer, Stuart Cheshire, and Omer Shapira. "Responsiveness 
under Working Conditions" [Paasch2021] 


e Dave Reed and Levi Perigo. "Measuring ISP Performance in Broadband America: A Study of 
Latency Under Load" [Reed2021] 


« Eve M. Schooler and Rick Taylor. "Non-traditional Network Metrics" 


e Gino Dion. "Focusing on latency, not throughput, to provide better internet experience and 
network quality" [Dion2021] 


e Gregory Mirsky, Xiao Min, Gyan Mishra, and Liuyan Han. "The error performance metric in 
a packet-switched network" [Mirsky2021] 


e Jana Iyengar. "The Internet Exists In Its Use" [Iyengar2021] 


e Jari Arkko and Mirja Kuehlewind. "Observability is needed to improve network quality" 
[Arkko2021] 
e Joachim Fabini. "Network Quality from an End User Perspective" [Fabini2021] 
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« Jonathan Foulkes. "Metrics helpful in assessing Internet Quality" [Foulkes2021] 
e Kalevi Kilkki and Benajamin Finley. "In Search of Lost QoS" [Kilkki2021] 


e Karthik Sundaresan, Greg White, and Steve Glennon. "Latency Measurement: What is 
latency and how do we measure it?" 


e Keith Winstein. "Five Observations on Measuring Network Quality for Users of Real-Time 
Media Applications" 


e Ken Kerpez, Jinous Shafiei, John Cioffi, Pete Chow, and Djamel Bousaber. "Wi-Fi and 
Broadband Data" [Kerpez2021] 


e Kenjiro Cho. "Access Network Quality as Fitness for Purpose" 


e Koen De Schepper, Olivier Tilmans, and Gino Dion. "Challenges and opportunities of 
hardware support for Low Queuing Latency without Packet Loss" [DeSchepper2021] 


e Kyle MacMillian and Nick Feamster. "Beyond Speed Test: Measuring Latency Under Load 
Across Different Speed Tiers" [MacMillian2021] 


e Lucas Pardue and Sreeni Tellakula. "Lower-layer performance not indicative of upper-layer 
success" [Pardue2021] 


e Matt Mathis. "Preliminary Longitudinal Study of Internet Responsiveness" [Mathis2021] 
« Michael Welzl. "A Case for Long-Term Statistics" [Welz12021] 


« Mikhail Liubogoshchev. "Cross-layer Cooperation for Better Network Service" 
[Liubogoshchev2021] 


« Mingrui Zhang, Vidhi Goel, and Lisong Xu. "User-Perceived Latency to Measure CCAs" 
[Zhang2021] 


e Neil Davies and Peter Thompson. "Measuring Network Impact on Application Outcomes 
Using Quality Attenuation" [Davies2021] 


e Olivier Bonaventure and Francois Michel. "Packet delivery time as a tie-breaker for assessing 
Wi-Fi access points" [Michel2021] 


e Pedro Casas. "10 Years of Internet-QoE Measurements. Video, Cloud, Conferencing, Web and 
Apps. What do we Need from the Network Side?" [Casas2021] 


e Praveen Balasubramanian. "Transport Layer Statistics for Network Quality" 
[Balasubramanian2021] 


e Rajat Ghai. "Using TCP Connect Latency for measuring CX and Network Optimization" 
[Ghai2021] 


e Robin Marx and Joris Herbots. "Merge Those Metrics: Towards Holistic (Protocol) Logging" 
[Marx2021] 


e Sandor Laki, Szilveszter Nadas, Balazs Varga, and Luis M. Contreras. "Incentive-Based Traffic 
Management and QoS Measurements" [Laki2021] 


e Satadal Sengupta, Hyojoon Kim, and Jennifer Rexford. "Fine-Grained RTT Monitoring Inside 
the Network" [Sengupta2021] 


e Stuart Cheshire. "The Internet is a Shared Network" [Cheshire2021] 
e Toerless Eckert and Alex Clemm. "network-quality-eckert-clemm-00.4" 


e Vijay Sivaraman, Sharat Madanapalli, and Himal Kumar. "Measuring Network Experience 
Meaningfully, Accurately, and Scalably" [Sivaraman2021] 
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* Yaakov (J) Stein. "The Futility of QoS" [Stein2021] 


4. Workshop Topics and Discussion 


The agenda for the three-day workshop was broken into four separate sections that each played 
a role in framing the discussions. The workshop started with a series of introduction and 
problem space presentations (Section 4.1), followed by metrics considerations (Section 4.2), cross- 
layer considerations (Section 4.3), and a synthesis discussion (Section 4.4). After the four 
subsections concluded, a follow-on discussion was held to draw conclusions that could be agreed 
upon by workshop participants (Section 5). 


4.1. Introduction and Overviews 


The workshop started with a broad focus on the state of user Quality of Service (QoS) and Quality 
of Experience (QoE) on the Internet today. The goal of the introductory talks was to set the stage 
for the workshop by describing both the problem space and the current solutions in place and 
their limitations. 


The introduction presentations provided views of existing QoS and QoE measurements and their 
effectiveness. Also discussed was the interaction between multiple users within the network, as 
well as the interaction between multiple layers of the OSI stack. Vint Cerf provided a keynote 
describing the history and importance of the topic. 


4.1.1. Key Points from the Keynote by Vint Cerf 


We may be operating in a networking space with dramatically different parameters compared to 
30 years ago. This differentiation justifies reconsidering not only the importance of one metric 
over the other but also reconsidering the entire metaphor. 


It is time for the experts to look at not only adjusting TCP but also exploring other protocols, such 
as QUIC has done lately. It's important that we feel free to consider alternatives to TCP. TCP is not 
a teddy bear, and one should not be afraid to replace it with a transport layer with better 
properties that better benefit its users. 


A suggestion: we should consider exercises to identify desirable properties. As we are looking at 
the parametric spaces, one can identify "desirable properties", as opposed to "fundamental 
properties", for example, a low-latency property. An example coming from the Advanced 
Research Projects Agency (ARPA): you want to know where the missile is now, not where it was. 
Understanding drives particular parameter creation and selection in the design space. 


When parameter values are changed in extreme, such as connectiveness, alternative designs will 
emerge. One case study of note is the interplanetary protocol, where "ping" is no longer 
indicative of anything useful. While we look at responsiveness, we should not ignore 
connectivity. 
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Unfortunately, maintaining backward compatibility is painful. The work on designing IPV6 so as 
to transition from IPv4 could have been done better if the backward compatibility was 
considered. It is too late for IPv6, but it is not too late to consider this issue for potential future 
problems. 


IPV6 is still not implemented fully everywhere. It's been a long road to deployment since starting 
work in 1996, and we are still not there. In 1996, the thinking was that it was quite easy to 
implement IPv6, but that failed to hold true. In 1996, the dot-com boom began, where a lot of 
money was spent quickly, and the moment was not caught in time while the market expanded 
exponentially. This should serve as a cautionary tale. 


One last point: consider performance across multiple hops in the Internet. We've not seen many 
end-to-end metrics, as successfully developing end-to-end measurements across different 
network and business boundaries is quite hard to achieve. A good question to ask when 
developing new protocols is "will the new protocol work across multiple network hops?" 


Multi-hop networks are being gradually replaced by humongous, flat networks with sufficient 
connectivity between operators so that systems become 1 hop, or 2 hops at most, away from each 
other (e.g., Google, Facebook, and Amazon). The fundamental architecture of the Internet is 
changing. 


4.1.2. Introductory Talks 


The Internet is a shared network built on IP protocols using packet switching to interconnect 

multiple autonomous networks. The Internet's departure from circuit-switching technologies 

allowed it to scale beyond any other known network design. On the other hand, the lack of in- 
network regulation made it difficult to ensure the best experience for every user. 


As Internet use cases continue to expand, it becomes increasingly more difficult to predict which 
network characteristics correlate with better user experiences. Different application classes, e.g., 
video streaming and teleconferencing, can affect user experience in ways that are complex and 
difficult to measure. Internet utilization shifts rapidly during the course of each day, week, and 
year, which further complicates identifying key metrics capable of predicting a good user 
experience. 


QoS initiatives attempted to overcome these difficulties by strictly prioritizing different types of 
traffic. However, QoS metrics do not always correlate with user experience. The utility of the QoS 
metric is further limited by the difficulties in building solutions with the desired QoS 
characteristics. 


QoE initiatives attempted to integrate the psychological aspects of how quality is perceived and 
create statistical models designed to optimize the user experience. Despite these high modeling 
efforts, the QoE approach proved beneficial in certain application classes. Unfortunately, 
generalizing the models proved to be difficult, and the question of how different applications 
affect each other when sharing the same network remains an open problem. 
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The industry's focus on giving the end user more throughput/bandwidth led to remarkable 
advances. In many places around the world, a home user enjoys gigabit speeds to their ISP. This 
is so remarkable that it would have been brushed off as science fiction a decade ago. However, 
the focus on increased capacity came at the expense of neglecting another important core metric: 
latency. As a result, end users whose experience is negatively affected by high latency were 
advised to upgrade their equipment to get more throughput instead. [MacMillian2021] showed 
that sometimes such an upgrade can lead to latency improvements, due to the economical 
reasons of overselling the "value-priced" data plans. 


As the industry continued to give end users more throughput, while mostly neglecting latency 
concerns, application designs started to employ various latency and short service disruption 
hiding techniques. For example, a user's web browser performance experience is closely tied to 
the content in the browser's local cache. While such techniques can clearly improve the user 
experience when using stale data is possible, this development further decouples user 
experience from core metrics. 


In the most recent 10 years, efforts by Dave Taht and the bufferbloat society have led to 
significant progress in updating queuing algorithms to reduce latencies under load compared to 
simpler FIFO queues. Unfortunately, the home router industry has yet to implement these 
algorithms, mostly due to marketing and cost concerns. Most home router manufacturers 
depend on System on a Chip (SoC) acceleration to create products with a desired throughput. SoC 
manufacturers opt for simpler algorithms and aggressive aggregation, reasoning that a higher- 
throughput chip will have guaranteed demand. Because consumers are offered choices primarily 
among different high-throughput devices, the perception that a higher throughput leads to 
higher a QoS continues to strengthen. 


The home router is not the only place that can benefit from clearer indications of acceptable 
performance for users. Since users perceive the Internet via the lens of applications, it is 
important that we call upon application vendors to adopt solutions that stress lower latencies. 
Unfortunately, while bandwidth is straightforward to measure, responsiveness is trickier. Many 
applications have found a set of metrics that are helpful to their realm but do not generalize well 
and cannot become universally applicable. Furthermore, due to the highly competitive 
application space, vendors may have economic reasons to avoid sharing their most useful 
metrics. 


4.1.3. Introductory Talks - Key Points 


1. Measuring bandwidth is necessary but is not alone sufficient. 

2. In many cases, Internet users don't need more bandwidth but rather need "better 
bandwidth", i.e., they need other connectivity improvements. 

3. Users perceive the quality of their Internet connection based on the applications they use, 
which are affected by a combination of factors. There's little value in exposing a typical user 
to the entire spectrum of possible reasons for the poor performance perceived in their 
application-centric view. 

4. Many factors affecting user experience are outside the users' sphere of control. It's unclear 
whether exposing users to these other factors will help them understand the state of their 
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network performance. In general, users prefer simple, categorical choices (e.g., "good", 
"better", and "best" options). 


5. The Internet content market is highly competitive, and many applications develop their own 
"secret sauce". 


4.2. Metrics Considerations 


In the second agenda section, the workshop continued its discussion about metrics that can be 
used instead of or in addition to available bandwidth. Several workshop attendees presented 
deep-dive studies on measurement methodology. 


4.2.1. Common Performance Metrics 


Losing Internet access entirely is, of course, the worst user experience. Unfortunately, unless 
rebooting the home router restores connectivity, there is little a user can do other than 
contacting their service provider. Nevertheless, there is value in the systematic collection of 
availability metrics on the client side; these can help the user's ISP localize and resolve issues 
faster while enabling users to better choose between ISPs. One can measure availability directly 
by simply attempting connections from the client side to distant locations of interest. For 
example, Ookla's [Speedtest] uses a large number of Android devices to measure network and 
cellular availability around the globe. Ookla collects hundreds of millions of data points per day 
and uses these for accurate availability reporting. An alternative approach is to derive 
availability from the failure rates of other tests. For example, [FCC_MBA] and 

[FCC MBA methodology] use thousands of off-the-shelf routers, with measurement software 
developed by [SamKnows]. These routers perform an array of network tests and report 
availability based on whether test connections were successful or not. 


Measuring available capacity can be helpful to end users, but it is even more valuable for service 
providers and application developers. High-definition video streaming requires significantly 
more capacity than any other type of traffic. At the time of the workshop, video traffic constituted 
90% of overall Internet traffic and contributed to 95% of the revenues from monetization (via 
subscriptions, fees, or ads). As a result, video streaming services, such as Netflix, need to 
continuously cope with rapid changes in available capacity. The ability to measure available 
capacity in real time leverages the different adaptive bitrate (ABR) compression algorithms to 
ensure the best possible user experience. Measuring aggregated capacity demand allows ISPs to 
be ready for traffic spikes. For example, during the end-of-year holiday season, the global 
demand for capacity has been shown to be 5-7 times higher than during other seasons. For end 
users, knowledge of their capacity needs can help them select the best data plan given their 
intended usage. In many cases, however, end users have more than enough capacity, and adding 
more bandwidth will not improve their experience -- after a point, it is no longer the limiting 
factor in user experience. Finally, the ability to differentiate between the "throughput" and the 
"goodput" can be helpful in identifying when the network is saturated. 


In measuring network quality, latency is defined as the time it takes a packet to traverse a 
network path from one end to the other. At the time of this report, users in many places 
worldwide can enjoy Internet access that has adequately high capacity and availability for their 
current needs. For these users, latency improvements, rather than bandwidth improvements, 
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can lead to the most significant improvements in QoE. The established latency metric is a round- 
trip time (RTT), commonly measured in milliseconds. However, users often find RTT values 
unintuitive since, unlike other performance metrics, high RTT values indicate poor latency and 
users typically understand higher scores to be better. To address this, [Paasch2021] and 
[Mathis2021] present an inverse metric, called "Round-trips Per Minute" (RPM). 


There is an important distinction between "idle latency" and "latency under working conditions". 
The former is measured when the network is underused and reflects a best-case scenario. The 
latter is measured when the network is under a typical workload. Until recently, typical tools 
reported a network's idle latency, which can be misleading. For example, data presented at the 
workshop shows that idle latencies can be up to 25 times lower than the latency under typical 
working loads. Because of this, it is essential to make a clear distinction between the two when 
presenting latency to end users. 


Data shows that rapid changes in capacity affect latency. [Foulkes2021] attempts to quantify how 
often a rapid change in capacity can cause network connectivity to become "unstable" (i.e., 
having high latency with very little throughput). Such changes in capacity can be caused by 
infrastructure failures but are much more often caused by in-network phenomena, like changing 
traffic engineering policies or rapid changes in cross-traffic. 


Data presented at the workshop shows that 36% of measured lines have capacity metrics that 
vary by more than 10% throughout the day and across multiple days. These differences are 
caused by many variables, including local connectivity methods (Wi-Fi vs. Ethernet), competing 
LAN traffic, device load/configuration, time of day, and local loop/backhaul capacity. These factor 
variations make measuring capacity using only an end-user device or other end-network 
measurement difficult. A network router seeing aggregated traffic from multiple devices 
provides a better vantage point for capacity measurements. Such a test can account for the 
totality of local traffic and perform an independent capacity test. However, various factors might 
still limit the accuracy of such a test. Accurate capacity measurement requires multiple samples. 


As users perceive the Internet through the lens of applications, it may be difficult to correlate 
changes in capacity and latency with the quality of the end-user experience. For example, web 
browsers rely on cached page versions to shorten page load times and mitigate connectivity 
losses. In addition, social networking applications often rely on prefetching their "feed" items. 
These techniques make the core in-network metrics less indicative of the users' experience and 
necessitates collecting data from the end-user applications themselves. 


It is helpful to distinguish between applications that operate on a "fixed latency budget" from 
those that have more tolerance to latency variance. Cloud gaming serves as an example 
application that requires a "fixed latency budget", as a sudden latency spike can decide the "win/ 
lose" ratio for a player. Companies that compete in the lucrative cloud gaming market make 
significant infrastructure investments, such as building entire data centers closer to their users. 
These data centers highlight the economic benefit that lower numbers of latency spikes outweigh 
the associated deployment costs. On the other hand, applications that are more tolerant to 
latency spikes can continue to operate reasonably well through short spikes. Yet, even those 
applications can benefit from consistently low latency depending on usage shifts. For example, 
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Video-on-Demand (VOD) apps can work reasonably well when the video is consumed linearly, 
but once the user tries to "switch a channel" or to "skip ahead", the user experience suffers unless 
the latency is sufficiently low. 


Finally, as applications continue to evolve, in-application metrics are gaining in importance. For 
example, VOD applications can assess the QoE by application-specific metrics, such as whether 
the video player is able to use the highest possible resolution, identifying when the video is 
smooth or freezing, or other similar metrics. Application developers can then effectively use 
these metrics to prioritize future work. All popular video platforms (YouTube, Instagram, Netflix, 
and others) have developed frameworks to collect and analyze VOD metrics at scale. One 
example is the Scuba framework used by Meta [Scuba]. 


Unfortunately, in-application metrics can be challenging to use for comparative research 
purposes. First, different applications often use different metrics to measure the same 
phenomena. For example, application A may measure the smoothness of video via "mean time to 
rebuffer", while application B may rely on the "probability of rebuffering per second" for the 
same purpose. A different challenge with in-application metrics is that VOD is a significant 
source of revenue for companies, such as YouTube, Facebook, and Netflix, placing a proprietary 
incentive against exchanging the in-application data. A final concern centers on the privacy 
issues resulting from in-application metrics that accurately describe the activities and 
preferences of an individual end user. 


4.2.2. Availability Metrics 


Availability is simply defined as whether or not a packet can be sent and then received by its 
intended recipient. Availability is naively thought to be the simplest to measure, but it is more 
complex when considering that continual, instantaneous measurements would be needed to 
detect the smallest of outages. Also difficult is determining the root cause of infallibility: was the 
user's line down, was something in the middle of the network, or was it the service with which 
the user was attempting to communicate? 


4.2.3. Capacity Metrics 


If the network capacity does not meet user demands, the network quality will be impacted. Once 
the capacity meets the demands, increasing capacity won't lead to further quality improvements. 


The actual network connection capacity is determined by the equipment and the lines along the 
network path, and it varies throughout the day and across multiple days. Studies involving DSL 
lines in North America indicate that over 30% of the DSL lines have capacity metrics that vary by 
more than 10% throughout the day and across multiple days. 


Some factors that affect the actual capacity are: 


1. Presence of a competing traffic, either in the LAN or in the WAN environments. In the LAN 
setting, the competing traffic reflects the multiple devices that share the Internet connection. 
In the WAN setting, the competing traffic often originates from the unrelated network flows 
that happen to share the same network path. 
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2. Capabilities of the equipment along the path of the network connection, including the data 
transfer rate and the amount of memory used for buffering. 


3. Active traffic management measures, such as traffic shapers and policers that are often used 
by the network providers. 


There are other factors that can negatively affect the actual line capacities. 


The user demands of the traffic follow the usage patterns and preferences of the particular users. 
For example, large data transfers can use any available capacity, while the media streaming 
applications require limited capacity to function correctly. Videoconferencing applications 
typically need less capacity than high-definition video streaming. 


4.2.4. Latency Metrics 


End-to-end latency is the time that a particular packet takes to traverse the network path from 
the user to their destination and back. The end-to-end latency comprises several components: 


1. The propagation delay, which reflects the path distance and the individual link technologies 
(e.g., fiber vs. satellite). The propagation doesn't depend on the utilization of the network, to 
the extent that the network path remains constant. 


2. The buffering delay, which reflects the time segments spent in the memory of the network 
equipment that connect the individual network links, as well as in the memory of the 
transmitting endpoint. The buffering delay depends on the network utilization, as well as on 
the algorithms that govern the queued segments. 


3. The transport protocol delays, which reflect the time spent in retransmission and 
reassembly, as well as the time spent when the transport is "head-of-line blocked". 


4. Some of the workshop submissions that have explicitly called out the application delay, 
which reflects the inefficiencies in the application layer. 


Typically, end-to-end latency is measured when the network is idle. Results of such 
measurements mostly reflect the propagation delay but not other kinds of delay. This report uses 
the term "idle latency" to refer to results achieved under idle network conditions. 


Alternatively, if the latency is measured when the network is under its typical working 
conditions, the results reflect multiple types of delays. This report uses the term "working 
latency" to refer to such results. Other sources use the term "latency under load" (LUL) as a 
synonym. 


Data presented at the workshop reveals a substantial difference between the idle latency and the 
working latency. Depending on the traffic direction and the technology type, the working latency 
is between 6 to 25 times higher than the idle latency: 


Direction Technology Working Idle Working -Idle Working / 
Type Latency Latency Difference Idle Ratio 
Downstream FTTH 148 10 138 15 
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Direction Technology Working Idle Working - Idle Working / 
Type Latency Latency Difference Idle Ratio 

Downstream Cable 103 13 90 8 

Downstream DSL 194 10 184 19 

Upstream FTTH 207 12 195 17 

Upstream Cable 176 27 149 6 

Upstream DSL 686 27 659 25 

Table 1 


While historically the tooling available for measuring latency focused on measuring the idle 
latency, there is a trend in the industry to start measuring the working latency as well, e.g., 
Apple's [NetworkQuality]. 


4.2.5. Measurement Case Studies 


The participants have proposed several concrete methodologies for measuring the network 
quality for the end users. 


[Paasch2021] introduced a methodology for measuring working latency from the end-user 
vantage point. The suggested method incrementally adds network flows between the user device 
and a server endpoint until a bottleneck capacity is reached. From these measurements, a round- 
trip latency is measured and reported to the end user. The authors chose to report results with 
the RPM metric. The methodology had been implemented in Apple's macOS Monterey. 


[Mathis2021] applied the RPM metric to the results of more than 4 billion download tests that M- 
Lab performed from 2010-2021. During this time frame, the M-Lab measurement platform 
underwent several upgrades that allowed the research team to compare the effect of different 
TCP congestion control algorithms (CCAs) on the measured end-to-end latency. The study showed 
that the use of cubic CCA leads to increased working latency, which is attributed to its use of 
larger queues. 


[Schlinker2019] presented a large-scale study that aimed to establish a correlation between 
goodput and QoE on a large social network. The authors performed the measurements at 
multiple data centers from which video segments of set sizes were streamed to a large number of 
end users. The authors used the goodput and throughput metrics to determine whether 
particular paths were congested. 


[Reed2021] presented the analysis of working latency measurements collected as part of the 
Measuring Broadband America (MBA) program by the Federal Communication Commission 
(FCC). The FCC does not include working latency in its yearly report but does offer it in the raw 
data files. The authors used a subset of the raw data to identify important differences in the 
working latencies across different ISPs. 
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[MacMillian2021] presented analysis of working latency across multiple service tiers. They found 
that, unsurprisingly, "premium" tier users experienced lower working latency compared to a 
"value" tier. The data demonstrated that working latency varies significantly within each tier; 
one possible explanation is the difference in equipment deployed in the homes. 


These studies have stressed the importance of measurement of working latency. At the time of 
this report, many home router manufacturers rely on hardware-accelerated routing that uses 
FIFO queues. Focusing on measuring the working latency measurements on these devices and 
making the consumer aware of the effect of choosing one manufacturer vs. another can help 
improve the home router situation. The ideal test would be able to identify the working latency 
and pinpoint the source of the delay (home router, ISP, server side, or some network node in 
between). 


Another source of high working latency comes from network routers exposed to cross-traffic. As 
[Schlinker2019] indicated, these can become saturated during the peak hours of the day. 
Systematic testing of the working latency in routers under load can help improve both our 
understanding of latency and the impact of deployed infrastructure. 


4.2.6. Metrics Key Points 


The metrics for network quality can be roughly grouped into the following: 


1. Availability metrics, which indicate whether the user can access the network at all. 


2. Capacity metrics, which indicate whether the actual line capacity is sufficient to meet the 
user's demands. 


3. Latency metrics, which indicate if the user gets the data in a timely fashion. 


4. Higher-order metrics, which include both the network metrics, such as inter-packet arrival 
time, and the application metrics, such as the mean time between rebuffering for video 
streaming. 


The availability metrics can be seen as a derivative of either the capacity (zero capacity leading 
to zero availability) or the latency (infinite latency leading to zero availability). 


Key points from the presentations and discussions included the following: 


1. Availability and capacity are "hygienic factors" -- unless an application is capable of using 
extra capacity, end users will see little benefit from using over-provisioned lines. 


N 


. Working latency has a stronger correlation with the user experience than latency under an 
idle network load. Working latency can exceed the idle latency by order of magnitude. 


oO 


. The RPM metric is a stable metric, with positive values being better, that may be more 
effective when communicating latency to end users. 

4. The relationship between throughput and goodput can be effective in finding the saturation 
points, both in client-side [Paasch2021] and server-side [Schlinker2019] settings. 

. Working latency depends on the algorithm choice for addressing endpoint congestion 
control and router queuing. 


ul 


Finally, it was commonly agreed to that the best metrics are those that are actionable. 
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4.3. Cross-Layer Considerations 


In the cross-layer segment of the workshop, participants presented material on and discussed 
how to accurately measure exactly where problems occur. Discussion centered especially on the 
differences between physically wired and wireless connections and the difficulties of accurately 
determining problem spots when multiple different types of network segments are responsible 
for the quality. As an example, [Kerpez2021] showed that a limited bandwidth of 2.4 Ghz Wi-Fi 
bottlenecks the most frequently. In comparison, the wider bandwidth of the 5 Ghz Wi-Fi has only 
bottlenecked in 20% of observations. 


The participants agreed that no single component of a network connection has all the data 
required to measure the effects of the network performance on the quality of the end-user 
experience. 


e Applications that are running on the end-user devices have the best insight into their 
respective performance but have limited visibility into the behavior of the network itself and 
are unable to act based on their limited perspective. 


e ISPs have good insight into QoS considerations but are not able to infer the effect of the QoS 
metrics on the quality of end-user experiences. 


e Content providers have good insight into the aggregated behavior of the end users but lack 
the insight on what aspects of network performance are leading indicators of user behavior. 


The workshop had identified the need for a standard and extensible way to exchange network 
performance characteristics. Such an exchange standard should address (at least) the following: 


e A scalable way to capture the performance of multiple (potentially thousands of) endpoints. 


e The data exchange format should prevent data manipulation so that the different 
participants won't be able to game the mechanisms. 


e Preservation of end-user privacy. In particular, federated learning approaches should be 
preferred so that no centralized entity has the access to the whole picture. 


* A transparent model for giving the different actors on a network connection an incentive to 
share the performance data they collect. 


e An accompanying set of tools to analyze the data. 


4.3.1. Separation of Concerns 


Commonly, there's a tight coupling between collecting performance metrics, interpreting those 
metrics, and acting upon the interpretation. Unfortunately, such a model is not the best for 
successfully exchanging cross-layer data, as: 


e actors that are able to collect particular performance metrics (e.g., the TCP RTT) do not 
necessarily have the context necessary for a meaningful interpretation, 


e the actors that have the context and the computational/storage capacity to interpret metrics 
do not necessarily have the ability to control the behavior of the network/application, and 
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e the actors that can control the behavior of networks and/or applications typically do not 
have access to complete measurement data. 


The participants agreed that it is important to separate the above three aspects, so that: 


e the different actors that have the data, but not the ability to interpret and/or act upon it, 
should publish their measured data and 


e the actors that have the expertise in interpreting and synthesizing performance data should 
publish the results of their interpretations. 


4.3.2. Security and Privacy Considerations 


Preserving the privacy of Internet end users is a difficult requirement to meet when addressing 
this problem space. There is an intrinsic trade-off between collecting more data about user 
activities and infringing on their privacy while doing so. Participants agreed that observability 
across multiple layers is necessary for an accurate measurement of the network quality, but 
doing so in a way that minimizes privacy leakage is an open question. 


4.3.3. Metric Measurement Considerations 


° The following TCP protocol metrics have been found to be effective and are available for 
passive measurement: 
° TCP connection latency measured using selective acknowledgment (SACK) or 
acknowledgment (ACK) timing, as well as the timing between TCP retransmission events, 
are good proxies for end-to-end RTT measurements. 


> On the Linux platform, the tcp_info structure is the de facto standard for an application to 
inspect the performance of kernel-space networking. However, there is no equivalent de 
facto standard for user-space networking. 


e The QUIC and MASQUE protocols make passive performance measurements more 
challenging. 
° An approach that uses federated measurement/hierarchical aggregation may be more 
valuable for these protocols. 


° The QLOG format seems to be the most mature candidate for such an exchange. 


4.3.4. Towards Improving Future Cross-Layer Observability 


The ownership of the Internet is spread across multiple administrative domains, making 
measurement of end-to-end performance data difficult. Furthermore, the immense scale of the 
Internet makes aggregation and analysis of this difficult. [Marx2021] presented a simple logging 
format that could potentially be used to collect and aggregate data from different layers. 


Another aspect of the cross-layer collaboration hampering measurement is that the majority of 
current algorithms do not explicitly provide performance data that can be used in cross-layer 
analysis. The IETF community could be more diligent in identifying each protocol's key 
performance indicators and exposing them as part of the protocol specification. 
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Despite all these challenges, it should still be possible to perform limited-scope studies in order to 
have a better understanding of how user quality is affected by the interaction of the different 
components that constitute the Internet. Furthermore, recent development of federated learning 
algorithms suggests that it might be possible to perform cross-layer performance measurements 
while preserving user privacy. 


4.3.5. Efficient Collaboration between Hardware and Transport Protocols 


With the advent of the low latency, low loss, and scalable throughput (L4S) congestion 
notification and control, there is an even higher need for the transport protocols and the 
underlying hardware to work in unison. 


At the time of the workshop, the typical home router uses a single FIFO queue that is large 
enough to allow amortizing the lower-layer header overhead across multiple transport PDUs. 
These designs worked well with the cubic congestion control algorithm, yet the newer generation 
of algorithms can operate on much smaller queues. To fully support latencies less than 1 ms, the 
home router needs to work efficiently on sequential transmissions of just a few segments vs. 
being optimized for large packet bursts. 


Another design trait common in home routers is the use of packet aggregation to further 
amortize the overhead added by the lower-layer headers. Specifically, multiple IP datagrams are 
combined into a single, large transfer frame. However, this aggregation can add up to 10 ms to 
the packet sojourn delay. 


Following the famous "you can't improve what you don't measure" adage, it is important to 
expose these aggregation delays in a way that would allow identifying the source of the 
bottlenecks and making hardware more suitable for the next generation of transport protocols. 


4.3.6. Cross-Layer Key Points 
Significant differences exist in the characteristics of metrics to be measured and the required 
optimizations needed in wireless vs. wired networks. 


e Identification of an issue's root cause is hampered by the challenges in measuring multi- 
segment network paths. 


e No single component of a network connection has all the data required to measure the 
effects of the complete network performance on the quality of the end-user experience. 


e Actionable results require both proper collection and interpretation. 


e Coordination among network providers is important to successfully improve the 
measurement of end-user experiences. 


e Simultaneously providing accurate measurements while preserving end-user privacy is 
challenging. 
e Passive measurements from protocol implementations may provide beneficial data. 
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4.4. Synthesis 


Finally, in the synthesis section of the workshop, the presentations and discussions concentrated 
on the next steps likely needed to make forward progress. Of particular concern is how to bring 
forward measurements that can make sense to end users trying to select between various 
networking subscription options. 


4.4.1. Measurement and Metrics Considerations 


One important consideration is how decisions can be made and what actions can be taken based 
on collected metrics. Measurements must be integrated with applications in order to get true 
application views of congestion, as measurements over different infrastructure or via other 
applications may return incorrect results. Congestion itself can be a temporary problem, and 
mitigation strategies may need to be different depending on whether it is expected to be a short- 
term or long-term phenomenon. A significant challenge exists in measuring short-term 
problems, driving the need for continuous measurements to ensure critical moments and long- 
term trends are captured. For short-term problems, workshop participants debated whether an 
issue that goes away is indeed a problem or is a sign that a network is properly adapting and self- 
recovering. 


Important consideration must be taken when constructing metrics in order to understand the 
results. Measurements can also be affected by individual packet characteristics -- differently 
sized packets typically have a linear relationship with their delay. With this in mind, 
measurements can be divided into a delay based on geographical distances, a packet-size 
serialization delay, and a variable (noise) delay. Each of these three sub-component delays can be 
different and individually measured across each segment in a multi-hop path. Variable delay can 
also be significantly impacted by external factors, such as bufferbloat, routing changes, network 
load sharing, and other local or remote changes in performance. Network measurements, 
especially load-specific tests, must also be run long enough to ensure that any problems 
associated with buffering, queuing, etc. are captured. Measurement technologies should also 
distinguish between upstream and downstream measurements, as well as measure the 
difference between end-to-end paths and sub-path measurements. 


4.4.2. End-User Metrics Presentation 


Determining end-user needs requires informative measurements and metrics. How do we 
provide the users with the service they need or want? Is it possible for users to even voice their 
desires effectively? Only high-level, simplistic answers like "reliability", "capacity", and "service 
bundling" are typical answers given in end-user surveys. Technical requirements that operators 
can consume, like "low-latency" and "congestion avoidance", are not terms known to and used by 
end users. 


Example metrics useful to end users might include the number of users supported by a service 
and the number of applications or streams that a network can support. An example solution to 
combat networking issues include incentive-based traffic management strategies (e.g., an 
application requesting lower latency may also mean accepting lower bandwidth). User-perceived 
latency must be considered, not just network latency -- user experience in-application to in- 
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server latency and network-to-network measurements may only be studying the lowest-level 
latency. Thus, picking the right protocol to use in a measurement is critical in order to match 
user experience (for example, users do not transmit data over ICMP, even though it is a common 
measurement tool). 


In-application measurements should consider how to measure different types of applications, 
such as video streaming, file sharing, multi-user gaming, and real-time voice communications. It 
may be that asking users for what trade-offs they are willing to accept would be a helpful 
approach: would they rather have a network with low latency or a network with higher 
bandwidth? Gamers may make different decisions than home office users or content producers, 
for example. 


Furthermore, how can users make these trade-offs in a fair manner that does not impact other 
users? There is a tension between solutions in this space vs. the cost associated with solving these 
problems, as well as which customers are willing to front these improvement costs. 


Challenges in providing higher-priority traffic to users centers around the ability for networks to 
be willing to listen to client requests for higher incentives, even though commercial interests 
may not flow to them without a cost incentive. Shared mediums in general are subject to 
oversubscribing, such that the number of users a network can support is either accurate on an 
underutilized network or may assume an average bandwidth or other usage metric that fails to 
be accurate during utilization spikes. Individual metrics are also affected by in-home devices 
from cheap routers to microwaves and by (multi-)user behaviors during tests. Thus, a single 
metric alone or a single reading without context may not be useful in assisting a user or operator 
to determine where the problem source actually is. 


User comprehension of a network remains a challenging problem. Multiple workshop 
participants argued for a single number (potentially calculated with a weighted aggregation 
formula) or a small number of measurements per expected usage (e.g., a "gaming" score vs. a 
"content producer" score). Many agreed that some users may instead prefer to consume 
simplified or color-coded ratings (e.g., good/better/best, red/yellow/green, or bronze/gold/ 
platinum). 


4.4.3. Synthesis Key Points 
e Some proposed metrics: 
° Round-trips Per Minute (RPM) 
> users per network 
o latency 
° 99% latency and bandwidth 


e Median and mean measurements are distractions from the real problems. 

e Shared network usage greatly affects quality. 

* Long measurements are needed to capture all facets of potential network bottlenecks. 
e Better-funded research in all these areas is needed for progress. 

« End users will best understand a simplified score or ranking system. 
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5. Conclusions 


During the final hour of the three-day workshop, statements that the group deemed to be 
summary statements were gathered. Later, any statements that were in contention were 
discarded (listed further below for completeness). For this document, the authors took the 
original list and divided it into rough categories, applied some suggested edits discussed on the 
mailing list, and further edited for clarity and to provide context. 


5.1. General Statements 


1. Bandwidth is necessary but not alone sufficient. 


2. In many cases, Internet users don't need more bandwidth but rather need "better 
bandwidth", i.e., they need other improvements to their connectivity. 


oO 


. We need both active and passive measurements -- passive measurements can provide 
historical debugging. 

4. We need passive measurements to be continuous, archivable, and queriable, including 

reliability/connectivity measurements. 


ul 


. Areally meaningful metric for users is whether their application will work properly or fail 
because of a lack of a network with sufficient characteristics. 


(op) 


. A useful metric for goodness must actually incentivize goodness -- good metrics should be 
actionable to help drive industries towards improvement. 


I 


. A lower-latency Internet, however achieved, would benefit all end users. 


5.2. Specific Statements about Detailed Protocols/Techniques 


1. Round-trips Per Minute (RPM) is a useful, consumable metric. 


2. We need a usable tool that fills the current gap between network reachability, latency, and 
speed tests. 


DQ 


. End users that want to be involved in QoS decisions should be able to voice their needs and 
desires. 

4. Applications are needed that can perform and report good quality measurements in order to 

identify insufficient points in network access. 


Gp] 


. Research done by regulators indicate that users/consumers prefer a simple metric per 
application, which frequently resolves to whether the application will work properly or not. 

6. New measurements and QoS or QoE techniques should not rely only or depend on reading 
TCP headers. 

. It is clear from developers of interactive applications and from network operators that lower 
latency is a strong factor in user QoE. However, metrics are lacking to support this statement 
directly. 


FI 


5.3. Problem Statements and Concerns 


1. Latency mean and medians are distractions from better measurements. 
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N 


. It is frustrating to only measure network services without simultaneously improving those 
services. 


oO 


. Stakeholder incentives aren't aligned for easy wins in this space. Incentives are needed to 
motivate improvements in public network access. Measurements may be one step towards 
driving competitive market incentives. 

4. For future-proof networking, it is important to measure the ecological impact of material 
and energy usage. 

. We do not have incontrovertible evidence that any one metric (e.g., latency or speed) is more 
important than others to persuade device vendors to concentrate on any one optimization. 


ul 


5.4. No-Consensus-Reached Statements 


Additional statements were discussed and recorded that did not have consensus of the group at 
the time, but they are listed here for completeness: 


1. We do not have incontrovertible evidence that bufferbloat is a prevalent problem. 


2. The measurement needs to support reporting localization in order to find problems. 
Specifically: 
° Detecting a problem is not sufficient if you can't find the location. 
o Need more than just English -- different localization concerns. 


3. Stakeholder incentives aren't aligned for easy wins in this space. 


6. Follow-On Work 


There was discussion during the workshop about where future work should be performed. The 
group agreed that some work could be done more immediately within existing IETF working 
groups (e.g., IPPM, DetNet, and RAW), while other longer-term research may be needed in IRTF 
groups. 


7. IANA Considerations 


This document has no IANA actions. 


8. Security Considerations 


A few security-relevant topics were discussed at the workshop, including but not limited to: 


« what prioritization techniques can work without invading the privacy of the communicating 
parties and 


e how oversubscribed networks can essentially be viewed as a DDoS attack. 
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